---
# id: guides-rag-triad
title: Using the RAG Triad for RAG evaluation
sidebar_label: RAG Triad
---

<head>
  <link rel="canonical" href="https://deepeval.com/guides/guides-rag-triad" />
</head>

Retrieval-Augmented Generation (RAG) is a powerful way for LLMs to generate responses based on context beyond the scope of its training data by supplying it with external data as additional context. These supporting context comes in the form of text chunks, which are usually parsed, vectorized, and indexed in vector databases for fast retrieval at inference time, hence the name retrieval, augmented, generation.

In a previous [guide](/guides/guides-rag-evaluation), we explored how the **generator** in a RAG pipeline can hallucinate despite being supplied additional context, while the **retriever** can often fail to retrieve the correct and relevant context to generate the optimal answer. This is why evaluating RAG pipelines are important and where the RAG triad comes into play.

## What is the RAG Triad?

<div
  style={{
    marginTop: "40px",
    marginBottom: "40px",
    display: "flex",
    justifyContent: "center",
  }}
>
  <img
    id="rag-triad"
    src="https://d2lsxfc3p6r9rv.cloudfront.net/rag-triad.svg"
  />
</div>

The **RAG triad** is composed of three RAG evaluation metrics: answer relevancy, faithfulness, and contextual relevancy. If a RAG pipeline scores high on all three metrics, we can confidently say that our RAG pipeline is using the optimal hyperparameters. This is because each metric in the RAG triad corresponds to a certain hyperparameter in the RAG pipeline. For instance:

- **Answer relevancy:** the answer relevancy metric determines how relevant the answers generated by your RAG generator is. Since LLMs nowadays are getting pretty good at reasoning, it is mainly the **prompt template** hyperparameter instead of the LLM you are iterating on when working with the answer relevancy metric. To be more specific, a low answer relevancy score signifies that you need to improve examples used in prompt templates for better in-context learning, or include more fine-grained prompting for better instruction following capabilities to generate more relevant responses.
- **Faithfulness:** the faithfulness metric determines how much the answers generated by your RAG generator are hallucinations. This concerns the **LLM** hyperparameter, and you'll want to switch to a different LLM or even fine-tune your own if your LLM is unable to leverage the retrieval context supplied to it to generate grounded answers.

  :::info
  You might also see the faithfulness metric called groundedness instead in other places. They are 100% the same thing but just named differently.
  :::

- **Contextual Relevancy:** the contextual relevancy metric determines whether the text chunks retrieved by your RAG retriever are relevant to producing the ideal answer for a user input. This concerns the **chunk size**, **top-K** and **embedding model** hyperparameter. A good embedding model ensures you're able to retrieve text chunks that are semantically similar to the embedded user query, while a good combination of chunk size and top-K ensures you only select the most important bits of information in your knowledge base.

:::caution
You might have noticed we didn't mention the contextual precision and contextual recall metric. For those wondering, this is because contextual precision and recall requires a labelled expected answer (i.e. the ideal answer to a user input) which may not be possible for everyone, which is why this guide serves as full referenceless RAG evaluation guide.
:::

## Using the RAG Triad in DeepEval

Using the RAG triad of metrics in `deepeval` is as simple as writing a few lines of code. First, create a test case to represent a user query, retrieved text chunks, and an LLM response:

```python
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(input="...", actual_output="...", retrieval_context=["..."])
```

Here, `input` is the user query, `actual_output` is the LLM generated response, and `retrieval_context` is a list of strings representing the retrieved text chunks. Then, define the RAG triad metrics:

```python
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric

...
answer_relevancy = AnswerRelevancyMetric()
faithfulness = FaithfulnessMetric()
contextual_relevancy = ContextualRelevancyMetric()
```

:::tip
You can find how these metrics are implemented and calculated on their respective documentation pages:

- [`AnswerRelevancyMetric`](/docs/metrics-answer-relevancy)
- [`FaithfulnessMetric`](/docs/metrics-faithfulness)
- [`ContextualRelevancyMetric`](/docs/metrics-contextual-relevancy)

:::

Lastly, evaluate your test case using these metrics:

```python
from deepeval import evaluate

...
evaluate(test_cases=[test_case], metrics=[answer_relevancy, faithfulness, contextual_relevancy])
```

Congratulations 🎉! You've learnt everything you need to know for the RAG triad.

## Scaling RAG Evaluation

As you scale up your RAG evaluation efforts, you can simply supply more test cases to the list of `test_cases` in the [`evaluate()` function](/docs/evaluation-introduction#evaluating-without-pytest) and more importantly, you can also [generate synthetic datasets using `deepeval`](/guides/guides-using-synthesizer) to test your RAG application at scale.
